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Abstract 

For many optimization problems it is possible to define a distance metric between problem 
variables that correlates with the likelihood and strength of interactions between the variables. 
For example, one may define a metric so that the dependencies between variables that are closer 
to each other with respect to the metric are expected to be stronger than the dependencies 
between variables that are further apart. The purpose of this paper is to describe a method 
that combines such a problem-specific distance metric with information mined from probabilistic 
models obtained in previous runs of estimation of distribution algorithms with the goal of solving 
future problem instances of similar type with increased speed, accuracy and reliability. While 
the focus of the paper is on additively decomposable problems and the hierarchical Bayesian 
optimization algorithm, it should be straightforward to generalize the approach to other model- 
directed optimization techniques and other problem classes. Compared to other techniques for 
learning from experience put forward in the past, the proposed technique is both more practical 
and more broadly applicable. 



1 Introduction 



Even for optimization problems that are extremely difficult to solve, it may be straight- 
forward to extract information about important dependencies between variables and other 
problem regularities directly from the problem definition (Baluja, 2006 Drezner & Salhi, 2002 



Hauschild Pelikan, 2010| |Stonedahl, Rand, fc Wilensky, 2008j |Schwarz & Qcenasek, 2000[ ). Fur- 
thermore, when solving many problem instances of similar type, it may be possible to gather 
information about variable interactions and other problem features by examining previous runs 
of the optimization algorithm, and to use this information to bias optimization of future 



problem instances to increase its speed, accuracy and reliability (Hauschild & Pelikan, 2008 
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Hauschild, Pelikan, Sastry, Goldberg, 201l|). 



The use of information from previous runs to 
into future runs of an evolutionary algorithm is often referred to as learn- 

|Hauschild, Pelikan, Sastry, &, Goldberg, 2011 



introduce bias 

ing from experience (Hauschild & Pelikan, 2008 
Pelikan, 2002). The use of bias based on the results of other learning tasks in 
the same problem domain is also commonplace in machine learning where it is re- 
ferred to as inductive transfer or transfer learning (Pratt, Mostow, Kamm, & Kamm, 1991 
Numerous studies have shown that using prior knowledge and learning 



Caruana, 1997). 



from experience promise improved efficiency and 


problem solving capabilities (Baluja, 2006 


Drezner & Salhi, 2002 




Hauschild & Pelikan, 2010 


Hauschild, Pelikan, Sastry, & Goldberg, 2011 


Rothlauf, 2006 


Stonedahl, Rand, & Wilensky, 2008 


Schwarz & Ocenasek, 2000). However, most 



prior work in this area was based on hand-crafted search operators, model restrictions, or repre- 
sentations. 

This paper describes an approach that combines prior problem-specific knowledge with learning 
from experience. The basic idea of the proposed approach comprises of (1) defining a problem- 
specific distance metric, (2) analyzing previous EDA models to quantify the likelihood and nature of 
dependencies at various distances, and (3) introducing bias into EDA model building based on the 
results of the analysis using Bayesian statistics. One of the key goals of this paper is to develop an 
automated procedure capable of introducing bias based on a distance metric and prior EDA runs, 
without requiring much expert knowledge or hand-crafted model restrictions from the practitioner. 
Furthermore, the proposed approach is intended to be applicable in a more practical manner than 
other approaches to learning from experience. For example, the proposed approach makes it feasible 
to use prior runs on problems of a smaller size to introduce bias when solving problem instances of 
a larger size, and the bias can be introduced even when the importance of dependencies between 
specific pairs of variables varies significantly across the problem domain. Although this paper 
focuses on the hierarchical Bayesian optimization algorithm (hBOA) and additively decomposable 
functions (ADFs), the proposed approach can also be applied to other model-directed optimization 
techniques and other problem types. The paper outlines a framework that can be used to adapt 
the proposed approach to a different context. 

The paper is organized as follows. Section [2] describes hBOA. Section [3] outlines the frame- 
work for introducing bias based on prior runs on similar problems in model-directed optimization. 
Section |4] presents the proposed approach to introducing bias into hBOA model building for addi- 
tively decomposable problems. Section [5] presents experimental results. Section [6] summarizes and 
concludes the paper. 



2 Hierarchical BOA 



The hierarchical Bayesian 
|Pelikan &: Goldberg, 2003 



optimization algorithm (hBOA) (Pelikan & Goldberg, 2001 
Pelikan, 2005 ) is an estimation of distribution algorithm 



(EDA) ( |Baluja, 1994j |Larranaga k Lozano, 2002] |Pelikan et al., 2002[ |Lozano et al., 2006 



|Pelikan et al., 2006 Hauschild & Pelikan, 2011). hBOA works with a population of candidate 
solutions represented by fixed-length strings over a finite alphabet. In this paper, candidate solu- 
tions are represented by ra-bit binary strings. The initial population of binary strings is generated 
at random according to the uniform distribution over candidate solutions. Each iteration starts by 
selecting promising solutions from the current population; here binary tournament selection with- 
out replacement is used. Next, hBOA (1) learns a Bayesian network with local structures for the 
selected solutions and (2) generates new candidate solutions by sampling the distribution encoded 
by the built network ( |Chickering, Heckerman, &: Meek, 1997 Friedman & Goldszmidt, 1999). To 
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maintain useful diversity in the population, the new candidate solutions are incorporated into 
the original population using restricted tournament selection (RTS) flHarik, 1995 ). The run is 



terminated when termination criteria are met. In this paper, each run is terminated either when 
the global optimum is found or when a maximum number of iterations is reached. Since the basic 
understanding of probabilistic models used in hBOA is necessary for the remainder of the paper, 
the rest of this section discusses the class of probabilistic models used in hBOA. 

hBOA represents probabilistic models of candidate solutions by Bayesian networks with lo- 
cal structures (Chickering, Heckerman, & Meek, 1997 Friedman & Goldszmidt, 1999). A Bayesian 



network is defined by two components: (1) an acyclic directed graph over problem variables speci- 
fying direct dependencies between variables and (2) conditional probabilities specifying the proba- 
bility distribution of each variable given the values of the variable's parents. A Bayesian network 
encodes a joint probability distribution as p(X\, . . . , X n ) = Yl7=iP(Xi\Tli) where Xi is the ith 
variable and Hi are the parents of Xi in the underlying graph. 

To represent conditional probabilities of each variable given the variable's parents, hBOA uses 
decision trees. Each internal node of a decision tree specifies a variable, and the subtrees of 
the node correspond to the different values of the variable. Each leaf of the decision tree for a 
particular variable defines the probability distribution of the variable given a condition specified 
by the constraints given by the path from the root of the tree to this leaf (constraints are given by 
the assignments of the variables along this path). 

To build probabilistic models, hBOA typically uses a greedy algorithm that initializes the 
decision tree for each problem variable Xi to a single-node tree that encodes the unconditional 
probability distribution of Xi. In each iteration, the model building algorithm tests how much a 
model would improve after splitting each leaf of each decision tree on each variable that is not 
already located on the path to the leaf. The algorithm executes the split that provides the most 
improvement, and the process is repeated until no more improvement is possible. 

Improvement of the model after a split is often evaluated using the Bayesian-Dirichlet (BDe) 
metric with penalty for model complexity. Bayesian measures evaluate the goodness of a Bayesian 



network structure given data D and background knowledge £ as flCooper Sz Herskovits, 1992 
Heckerman, Geiger, <fe Chickering, 1994) 



p(B\D,0 = cp(B\Op(D\B,Z), (1) 
where c is a normalization constant. For the Bayesian-Dirichlet metric, the term p(D\B,£) is 



estimated as (Chickering, Heckerman, & Meek, 1997) 



n 



T(mj(xi,l) +m' i (x i ,l)) 

rK(x i9 z)) 



where Lj is the set of leaves in the decision tree Tj for Xf, rrii{l) is the number of instances in 
the selected population which end up the traversal through the tree Tj in the leaf I; rrii(xi,l) is 
the number of instances that have Xi = Xi and end up the traversal of the tree Tj in the leaf 
I; m'^l) represents the prior knowledge about the value of rrii(i,l); and m-(xj,/) represents the 
prior knowledge about the value of mj(xj, I). Without any prior knowledge, an uninformative prior 
m'^Xi,!) = 1 is typically used. To favor simpler networks to the more complex ones, the prior 
probability of each network decreases exponentially fast with respect to the description length of 
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this network's parameters (Friedman & Goldszmidt, 1999; Pelikan, 2005): 

p(B) = c2-°- 5( ^i\ Li ^ lo ^ N , 



(3) 



where c is a normalization constant required for the prior probabilities of all network structures to 
sum to one. 



3 Bias Based on Previous EDA Runs 



Building an accurate probabilistic model in hBOA and other EDAs based on complex 
probabilistic models can be time consuming and it may require rather large populations 
of solutions. That is why much effort has been put into enhancing efficiency of model 
building in EDAs and improving quality of EDA models even with smaller populations 



of candidate solutions (Baluja, 2006 Hauschild k, Pelikan, 2008 


Hauschild & Pelikan, 2009 


Hauschild, Pelikan, Sastry, & Goldberg, 2011 


Miihlenbein & Mahnig, 2002). Learn- 


ing from experience (Hauschild &: Pelikan, 2008 


Hauschild & Pelikan, 2009 


Hauschild, Pelikan, Sastry, & Goldberg, 2011 


Pelikan, 2005) represents one approach to dealing 



with this issue. In learning from experience, models discovered by EDAs in previous runs are mined 
to identify regularities and the discovered regularities are used to bias model building in future 
runs on problems of similar type. Since learning model structure is often the most challenging 
task in model building, learning from experience often focuses on identifying regularities in model 
structure and using these regularities to bias structural learning in future runs. 

It is straightforward to collect statistics on the most frequent dependencies in EDA models. 
Nonetheless, for the collected statistics to be useful, it is important to ensure that the statistics are 
meaningful with respect to the problem being solved. For example, consider optimization of NK 



landscapes (Kauffman, 1989), in which the fitness function is defined as the sum of n subfunctions 
{/i}" =1 , and the subfunction fi is applied to the ith bit and its k neighbors. The neighbors of 
each bit are typically chosen at random for each problem instance. Therefore, if we consider 
1,000 problem instances of NK landscapes, looking at the percentage of models that included a 
dependency between the first two bits for the first 999 instances will not say much about the 
likelihood of the same dependency for the last instance. A similar observation can be made for 
many other important problem classes, such as MAXSAT or the quadratic assignment problem. 
That is why it is important to develop a more general framework that allows one to learn and use 
statistics on dependencies in EDA models across a range of problem domains of different structure 
and properties. In the remainder of this section we describe one such framework. 

To formalize the proposed framework to identifying structural regularities in EDA models, 
let us define a set of m dependency categories D = {D\, . . . , D m } and denote the background 
knowledge about the problem by £. Then, we can define a function 7(2, j, £) that, given £, maps 
any dependency covered by the probabilistic model into one of the m categories so that 

lihJiO = k if and only if € D^. Two straightforward possibilities for defining 7 function 

were proposed by Hauschild et al. (120081 I2009j) : (1) Each pair of problem variables Xi and Xj 
defines a special category, and (2) categories are defined using a discretization of a problem-specific 
distance metric between variables. The first approach is useful especially when solving a number of 
instances of a problem where each variable has a fixed meaning across the entire set of instances; this 
is the case for example in spin glasses defined on a regular lattice, where every pair of variables in 
the lattice can be assigned a special category because the structure of the problem does not change 
from one instance to another (Hauschild, Pelikan, Sastry, &; Lima, 2009). The latter approach is 
useful especially when one can define a distance metric on variables so that the distance between 
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two variables correlates strongly with the likelihood or strength of their interaction; for example, 
one may define a distance metric such that variables that interact more strongly are closer to 
each other according to the metric. Such a distance metric can be defined for example in the 
quadratic assignment problem, traveling salesman problem or, more generally, classes of additively 
decomposable functions. While these two approaches are applicable to many important classes of 
problems, one may envision many other approaches based on this framework. The key issue in 
defining 7 is that the categories should be related to the problem, so that each category contains 
pairs of variables that have a lot in common and that can be expected to be either correlated or 
independent most of the time. 

The statistics obtained from previous EDA models can be used to bias the search operators of 
model-directed optimization methods using either a soft bias or a hard bias. A soft bias allows one 
to define preference to some models using a prior distribution over network structures or partial 
variable assignments (Schwarz & Ocenasek, 2000 Hauschild & Pelikan, 2009). A hard bias encodes 
hard restrictions on model structure or variable assignments, restricting the class of allowable 
models dMiihlenbein fc Mahnig, 2002 [ |Baluja, 2006| |Hauschild fe Pelikan, 2008] ) . While in most 
prior work on bias in EDAs the bias was based on expert knowledge, in learning from experience 
the focus is on automated learning of a bias from past EDA runs. 

In this paper we describe one way of using the above framework to facilitate learning from 
experience in hBOA for additively decomposable problems based on a problem-specific distance 
metric. However, note that the framework can be applied to other EDAs based on graphical 
models. 



4 Distance-Based Bias 



4.1 Additively Decomposable Functions 

For many optimization problems, the objective function (fitness function) can be expressed in the 
form of an additively decomposable function (ADF) of m subproblems: 

m 

f(X h ...,X n ) = J2fi(Si), (4) 

i=i 

where (Xi,...,X n ) are problem's decision variables, fi is the ith subfunction, and Si C 
{Xi, X2, . . . , X n } is the subset of variables contributing to fi. While they may often exist multiple 
ways of decomposing the problem using additive decomposition, one would typically prefer decom- 
positions that minimize the sizes of subsets {Si}. It is of note that the difficulty of ADFs is not 
fully determined by the order of subproblems, but also by the definition of the subproblems and 
their interaction. In fact, there exist a number of NP-complete problems that can be formulated as 
ADFs with subproblems of order 2 or 3, such as MAXSAT for 3CNF formulas. On the other hand, 
one may easily define ADFs with subproblems of order n that can be solved by a simple bit-flip 
hill climbing in low-order polynomial time. 



4.2 Measuring Variable Distances for ADFs 

The definition of a distance between two variables of an ADF used in this paper fol- 
lows Hauschild and Pelikan (2008) and Hauschild et al. (2011). Given an additively decomposable 
problem, we define the distance between two variables using a graph G of n nodes, one node per 
variable. For any two variables X{ and Xj in the same subset Sk, that is, Xi, Xj £ Sk, we create an 
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(a) Nearest-neighbor NK. 
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(b) 2D spin glass. 



Figure 1: Dependencies between variables that are closer to each other are stronger than the dependencies 
between other variables. Furthermore, the proportion of splits capturing dependencies between variables 
at a given distance changes only little between instances of various sizes. This indicates that the statistics 
acquired from models of one size should be useful for biasing hBOA runs on problems of another size. The 
results are based on 979,020 models obtained from hBOA on 1,000 unique problem instances of each problem, 
10 runs per instance. 



edge in G between the nodes Xi and Xj. Denoting by kj the number of edges along the shortest 
path between Xi and Xj in G (in terms of the number of edges), we define the distance between 
two variables as 

jj/y \ f h,j if a path between Xi and Xj exists 

3 1 n otherwise 

The above distance measure makes variables in the same subproblem close to each other, whereas 
for the remaining variables, the distances correspond to the length of the chain of subproblems that 
relate the two variables. The distance is maximal for variables that are completely independent 
(the value of a variable does not influence the contribution of the other variable in any way) . 

Since interactions between problem variables are encoded mainly in the subproblems of the 
additive problem decomposition, the above distance metric should typically correspond closely to 
the likelihood of dependencies between problem variables in probabilistic models discovered by 
EDAs. Specifically, the variables located closer with respect to the metric should more likely 
interact with each other. Figure [T] illustrates this on two ADFs discussed later in this paper — the 
NK landscape with nearest neighbor interactions and the two-dimensional Ising spin glass (for a 
description of these problems, see section [5T|) . The figure analyzes probabilistic models discovered 
by hBOA in 10 independent runs on each of the 1,000 random instances for each problem and 
problem size. For a range of distances d between problem variables, the figure shows the proportion 
of splits on a variable located at distance d. The results clearly support the fact that hBOA models 
indicate strongest dependencies between variables located close to each other according to the 
aforementioned metric and that there is a clear correlation between the distance metric and the 
likelihood of dependencies. Furthermore, the figure indicates that the likelihood of dependencies 
at a specific distance does not change much from one problem size to another, indicating that the 
bias based on these statistics should be applicable across a range of problem sizes. 

It is important to note that other approaches may be envisioned to defining a distance metric 
for ADFs. For example, a weight may be added on each edge that would decrease with the number 
of subsets that contain the two connected variables. Another interesting possibility would be to 
consider the sub functions themselves in measuring the distances, so that only correlations that lead 
to nonlinearities are considered or that some correlations are given a priority over others. Finally, 
the distance of variables may depend on the problem definition itself, not on the decomposition 
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only. For example, in the quadratic assignment problem, a distance between two facility locations is 
specified directly by the problem instance. The key is to use problem-specific information to specify 
a distance metric so that the distance between a pair of variables correlates with the likelihood or 
strength of their interaction. 



4.3 Using Distance-Based Bias in hBOA 

The basic idea of incorporating the distance-based bias based on prior runs into hBOA is inspired 



mainly by the work of Hauschild et al. (Hauschild & Pelikan, 2009). Hauschild et al. proposed 



to incorporate learning from experience into hBOA by modifying prior probabilities of network 
structures using the statistics that capture the number of splits on each variable in the decision 
tree for each other variable in past hBOA runs on similar problems. Nonetheless, the approach of 
Hauschild et al. ( Hauschild Pelikan, 2009[ ) is only applicable to problems where the strength of 



interactions between any two variables is not expected to change much from instance to instance. 
That is why this approach can only be applied in a limited set of problem domains and it is difficult 
to use this approach when problem size is not fixed in all runs. In this paper, we propose to capture 
the nature of dependencies between variables with respect to their distance using the distance metric 
defined for ADFs or another distance metric. This allows one to apply the technique in more 
problem domains and also allows models from problem instances of one size to be useful in solving 
problem instances of different sizes (see figure [T]). 

Recall that the BDe metric used to evaluate probabilistic models in hBOA contains two parts: 
(1) prior probability p(B\£) of the network structure B, and (2) posterior probability p(D\B,£) of 
the data (population of selected solutions) given B. Prior probabilities of network structures are 
typically set to represent the uniform distribution over admissible network structures or to provide 
a bias toward simple models regardless of the problem. However, the prior probability distribution 
of network structures can also be used to specify preferable structures. In this paper, we will 
use the prior probability distribution of network structures to introduce bias toward models that 
resemble models obtained in previous runs on problems of similar type with the focus on distance- 
based statistics. An analogous approach can be used to incorporate bias into hBOA for a different 
mapping 7 of pairs of variables into dependency categories {Di}. 

Let us assume a set M of hBOA models from prior hBOA runs on similar ADFs. Before 
applying the bias by modifying the prior probability distribution of models in hBOA, the models 
in M are first processed to generate data that will serve as the basis for introducing the bias. The 
processing starts by analyzing the models in Al to determine the number s(m, d,j) of splits on any 
variable Xi such that D(Xi,Xj) = d in a decision tree Tj for variable Xj for a model m G M. 
Then, the values s(m, d,j) are used to compute the probability Pk(d,j) of a kth split on a variable 
at distance d from Xj in a dependency tree Tj given that k — 1 such splits were already performed 

\{meM:s{m,d,j)>k}\ 
k[,J) \{m€M:s(m,d,j)>k-i}\' {> 

Given the terms Pk(d,j), we can now define the prior probability of a network B as 

n n n s (d,j) 

p(B,0=cUU I] P k( d d), (6) 
d=lj=l k=l 

where n s (d,j) denotes the number of splits on any variable Xi such that D(Xi,Xj) = d in Tj, k > 
is used to tune the strength of bias (the strength of bias increases with n) , and c is a normalization 
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constant. Since log-likelihood is typically used to evaluate model quality, to evaluate the contri- 
bution of any particular split, the main difference from the standard version of the BDe metric 
with the complexity penalty is that instead of reducing the metric according to the additional com- 
plexity of log 2 (A r )/2 for each new split, we reduce the metric by the corresponding Klog 2 Pk(d, j). 
Therefore, the computation of the change of the prior probability of the network structure can still 
be done in constant time. Of course, the change in p(D\B,£) requires computation of marginal 
frequencies, so it cannot be done in constant time. 

It is important to note that the prior probability of hBOA models defined in eq. ([6]) 
is certainly not the only possible approach to incorporating learning from experience using 
distance-based statistics into hBOA. The main source of inspiration for the proposed ap- 
proach is the work on incorporating bias in learning Bayesian networks using Bayesian met- 
rics (Heckerman, Geiger, & Chickering, 1994) and the prior work on learning from experience in 
hBOA by Hauschild et al. (Hauschild &; Pelikan, 2009). The experimental results presented in the 
next section confirm that this approach leads to substantial speedups in both problem classes con- 
sidered in this paper and preliminary experiments in other problem domains including MAXSAT 
and minimum vertex cover indicate that substantial speedups can be obtained also in other problem 
classes defined as ADFs. 



5 Experiments 
5.1 Test Problems 

To test the proposed approach to biasing hBOA model building, we consider two problem 
classes: shuffled nearest-neighbor NK landscapes and two-dimensional spin glasses. Both these 
problem classes were shown to be challenging for conventional genetic algorithms and many 
other optimization techniques due to the rugged landscape, strong epistasis, and complex struc- 



ture of interactions between problem variables (Kauffman, 1989 Young, 1998| Pelikan, 2010 



Pelikan & Hartmann, 2006). However, for both problem classes, it is straightforward to gener- 
ate a large number of problem instances with known optima. For each problem class and problem 
size, we use 1,000 unique problem instances; the reason for using such a large number of instances is 
that for these problem classes, algorithm performance often varies substantially from one instance 
to another and the results would thus be unreliable if only a few instances were used. 



An NK fitness landscape (Kauffman, 1989 ) is fully defined by the following components: (1) The 
number of bits, n, (2) the number of neighbors per bit, k, (3) a set of k neighbors Tl(Xi) of the ith bit 
for every i € {1, . . . , n}, and (4) a subfunction /j defining a real value for each combination of values 
of Xi and IL(Xi) for every i £ {1, . . . , n}. Typically, each subfunction is defined as a lookup table. 
The objective function / n & to maximize is defined as f n k(Xi, X2, ■ ■ ■ ,X n ) = Y17=i fi(Xi,H(Xi)). 
The difficulty of optimizing NK landscapes depends on all components defining an NK problem 
instance. In this paper, we consider nearest-neighbor NK landscapes, in which neighbors of each 
bit are restricted to the k bits that immediately follow this bit. The neighborhoods wrap around; 
thus, for bits which do not have k bits to the right, the neighborhood is completed with the first 
few bits of solution strings. The reason for restricting neighborhoods to nearest neighbors was to 
ensure that the problem instances can be solved in polynomial time even for k > 1 using dynamic 



programming (Pelikan, 2010). The subfunctions are represented by look-up tables (a unique value 
is used for each instance of a bit and its neighbors), and each entry in the look-up table is generated 
with the uniform distribution from [0, 1). The used class of NK landscapes with nearest neighbors 



is thus the same as that in Pelikan (2010 ). In all experiments, we use k = 5 and n € {100, 150, 200}. 
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For each n, we use 1,000 unique, independently generated instances; overall, 3,000 unique instances 
of NK landscapes were tested. 

Ising spin glasses are prototypical models for disordered systems ( Young, 1998 ). A simple model 
to describe a finite-dimensional Ising spin glass is typically arranged on a regular 2D or 3D grid 
where each node i corresponds to a spin Sj and each edge corresponds to a coupling between 
two spins Si and Sj. Each edge has a real value Jij associated with it that defines the relationship 
between the two connected spins. To approximate the behavior of the large-scale system, periodic 
boundary conditions are often used that introduce a coupling between the first and the last elements 
in each dimension. For the classical Ising model, each spin Sj can be in one of two states: Sj = +1 
or Si = —1. Given a set of coupling constants Jij, and a configuration of spins C, the energy can be 
computed as E(C) = — Yltij) s iJi,j s j-, where the sum runs over all couplings Here the task is 

to find a spin configuration for a given set of coupling constants that minimizes the energy of the 
spin glass. The states with minimum energy are called ground states. The spin configurations are 
encoded with binary strings where each bit specifies the value of one spin (0 for a spin +1, 1 for a 
spin -1). One generally analyzes a large set of random spin glass instances for a given distribution of 
the spin-spin couplings. In this paper we consider the ± J spin glass, where each spin-spin coupling 
is set randomly to either +1 or —1 with equal probability. We use instances arranged on square 
grids of sizes 10 x 10, 12 x 12, 14 x 14, 16 x 16, 18 x 18 and 20 x 20 spins; that is, the problem sizes 
range from 100 to 400 spins. We consider periodic boundary conditions. For each problem size, 
we use 1,000 unique, independently generated problem instances; overall, 6,000 unique instances 
of the 2D spin glass were tested. All instances were obtained from the Spin Glass Ground State 
Server (Spin Glass Ground State Server, 2004). 



5.2 10-Fold Crossvalidation 

To ensure that the same problem instances were not used for defining the bias as well as for testing 
it, 10-fold crossvalidation was used. For each problem size and each problem, 1,000 random problem 
instances were used in the experiments. The 1,000 instances in each set were randomly split into 
10 equally sized subsets of 100 instances each. In each round of crossvalidation, 1 subset of 100 
instances was left out and hBOA was run on the remaining 9 subsets of 900 instances total. The 
runs on the 9 subsets produced a number of models that were analyzed in order to obtain the 
probabilities Pk(d,j) for all d, j, and k. The bias based on the obtained values of Pk(d,j) was 
then used in hBOA runs on the remaining subset that was left out. The same procedure was 
repeated for each subset; overall, 10 rounds of crossvalidation were performed for each set of 1,000 
instances. Each problem instance was used exactly once in the test of the proposed approach to 
biasing hBOA models and in every test, models used to generate statistics for hBOA bias were 
obtained from hBOA runs on different problem instances. 

While the experiments were performed across a variety of computer architectures and config- 
urations, it was always ensured that the base case with no bias and the case with bias were both 
run on the same computational node and the results of the two runs can therefore be compared 
against each other with respect to the actual CPU time. 



5.3 Experimental Setup 

The maximum number of iterations for each problem instance was set to the overall number of bits 
in the problem; according to preliminary experiments, this upper bound substantially exceeded 
the actual number of iterations required to solve each problem. Each run was terminated either 
when the global optimum was found, when the population consisted of copies of a single candidate 
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solution, or when the maximum number of iterations was reached. For each problem instance, we 
used bisection flSastry, 2001 Pelikan, 2005 ) to ensure that the population size was within 5% of 



the minimum population size to find the optimum in 10 out of 10 independent runs. 

Bit-flip hill climbing (HC) is incorporated into hBOA to improve its performance. HC takes a 
candidate solution represented by an re-bit binary string on input. Then, it performs one-bit changes 
on the solution that lead to the maximum improvement of solution quality. HC is terminated when 
no single-bit flip improves solution quality and the solution is thus locally optimal. Here, HC is 
used to improve every solution in the population before the evaluation is performed. Without HC, 
the number and size of problem instances would have to substantially reduce due to the increased 
computational requirements. However, preliminary results indicate that even without HC the 
benefits of the proposed approach would be substantial. 

The proposed approach to distance-based bias in hBOA is tested for k = {1,2,3,4,5,6,7} to 
assess how the strength of the bias represented by k affects hBOA performance. 

To evaluate hBOA performance, we focus on (1) the execution time per run, (2) the number of 
steps of HC, (3) the number of evaluations, and (4) the required population size. The steps of HC 
are not counted as evaluations in order to distinguish between evaluations and HC steps, because 
for many additively decomposable problems, performing a HC step is much less computationally 
expensive than evaluating a solution. To evaluate the benefits of distance-based bias, the paper 
uses multiplicative speedups, where the speedup is defined as a multiplicative factor by which a 
particular complexity measure improves by using the distance-based bias compared to the base 
case with no distance-based bias. For example, an execution-time speedup of 2 indicates that the 
bias allowed hBOA to find the optimum twice as fast as without the bias. Although the code could 
be further optimized for efficiency, the primary focus of our experiments concerning the execution 
times was on the speedups of the CPU times rather than their absolute values. We have used 
the most efficient implementation of hBOA available for the base case with no bias and we only 
modified it for the remaining cases to incorporate the bias. 

5.4 Results 



Fig. 2(a) shows the effects of k on the multiplicative speedups with respect to the execution time, 
the number of evaluations, the number of HC steps, and the population size. The results confirm 
that, for adequate values of k, the speedups in terms of execution time are substantial for both 
NK landscapes as well as 2D spin glasses; the maximum speedup for NK landscapes was over 2.26 
whereas for spin glasses it was over 1.66. For NK landscapes, the speedups in terms of execution 
time grow both with problem size n and with k, and they can be expected to increase further for 
even larger values of n or k. For spin glasses, the speedups seem nearly independent of problem 
size and best speedups are obtained for k = 3. The multiplicative speedups in terms of the number 
of evaluations, the number of HC steps and the population size indicate that the reduction in the 
population sizes appears to be one of the most important factors reducing the overall computational 
cost, although improvements can be observed also in most other statistics. 



Fig. 2(b) shows the speedups obtained with respect to the problem size for a range of values of 
k; these results are useful for visualizing the relationship between the problem size and the speedups 
obtained with the distance-based bias. The results confirm that for NK landscapes, the speedups 
appear to grow at least linearly with the problem size, regardless of the value of k. However, for the 
2D spin glass, the speedups fluctuate around the same value for all problem sizes. The speedups 
obtained on NK landscapes are thus not only better, but they further improve with problem size, 
unlike for the 2D spin glass. On one hand, one may argue that this is due to the fact that NK 
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landscapes with nearest neighbor interactions have a simpler structure than the 2D spin glass due 
to the short-range interactions. On the other hand, the interactions in 2D spin glasses are of 
much smaller order (subproblems in problem decomposition have 2 bits each instead of 6). We 
are currently evaluating the distance-based bias on other classes of nearly decomposable problems 
such as MAXSAT and the minimum vertex cover in order to provide more empirical evidence that 
would help explain when the distance-based bias works better and when it has limitations. 

In summary, fig. [2] provides solid empirical evidence that the speedups obtained are substantial 
and that the proposed approach to learning from experience is useful in practice. 



6 Summary and Conclusions 

This paper introduced a practical approach to incorporating bias in estimation of distribution al- 
gorithms (EDAs) based on models built in previous EDA runs on problems of similar type. The 
approach was demonstrated on the hierarchical Bayesian optimization algorithm (hBOA) and ad- 
ditively decomposable functions, although the framework can be applied also to other EDAs based 
on graphical models and other problem types. For example, it should be straightforward to adapt 



this framework to the extended compact genetic algorithm (Harik, 1999) or classes of facility lo- 
cation problems. The key idea of the proposed approach was to define a distance metric that 
corresponds to the likelihood of dependencies between variables, and to use the statistics on de- 
pendencies at various distances in previous hBOA runs as the basis for introducing bias in future 
hBOA runs. The bias was introduced using prior probabilities of Bayesian network structures. The 
models were thus learned using a combination of the selected population of candidate solutions 
and the prior knowledge extracted from previous hBOA models. The strength of the bias can be 
tuned with a user-defined parameter k > 0. The proposed approach was tested on two challenging 
additively decomposable functions, the NK landscapes with nearest-neighbor interactions and the 
two-dimensional Ising spin glass. The results on 9,000 unique problem instances from the two prob- 
lem classes provided empirical evidence that the proposed approach provides substantial speedups 
across a variety of settings. Specifically, speedups of over 2.26 were achieved for NK landscapes, and 
speedups of over 1.66 were achieved for spin glasses. Furthermore, the speedups for NK landscapes 
grew with problem size and the parameter k, and it can thus be expected that higher speedups can 
be achieved in practice. Preliminary experiments on other problem classes, including MAXSAT 
and minimum vertex cover, indicate that the approach provides substantial benefits also in other 
important classes of additively decomposable functions. 

The results thus reaffirm that one of the key advantages of EDAs is that EDAs provide prac- 
titioners with a rigorous framework for incorporating prior knowledge and for automated learning 
from solving instances of similar type so that future problem instances can be solved with increased 
speed, accuracy, and reliability. EDAs thus not only allow practitioners to scalably solve problems 
with high levels of epistasis (variable interactions), but they also allow effective inductive transfer 
(transfer learning) in optimization. 

In future work, the approach should be tested on other additively decomposable problems. 
Experiments should also be done to confirm the hypothesis that models obtained on problems 
of one size can be used to bias model building on problems of another size. Furthermore, the 
approach should be adapted to other model-directed optimization techniques, including other EDAs 
and genetic algorithms with linkage learning. The approach should also be modified to introduce 
bias on problems that cannot be formulated using an additive decomposition in a straightforward 
manner. Finally, it is important to study the limitations of the proposed approach, and create 
theoretical models to automatically tune the strength of the bias and predict expected speedups. 
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(a) Effects of k on the reduction of the execution (CPU) time, the number of evaluations, the 
number of steps of local search, and the population size. 
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(b) Effects of problem size on execution-time speedups. 

Figure 2: Effects of the distance-based bias based on models from hBOA runs on other problem instances 
measured by multiplicative speedups. The speedup for a particular statistic is the ratio of the value of the 
statistic without the bias and its value with the bias. Thus, the greater the speedup, the better the effects 
of the bias. For example, if CPU speedup is 2, hBOA with the distance-based bias is twice as fast as hBOA 
without it in terms of the total execution time. The results for NK landscapes are shown on the left, the 
results for 2D spin glasses are shown on the right. 
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